Audience: Diverse Background
Time: 1 day workshop (6 hours)
Pre-Requisites: Prior experience with the python programming language is essential: this is not an Introduction to Python. Basic competency is assumed. If you have not use python before consider taking: Intro to Python (Data Science Campus) or data camp courses prior to attending.
Brief Description: Natural Language Processing is a sub-field of Artificial Intelligence. It is used for processing and analysing large amounts of natural language (linguistics). Some applications include search engines (Google), text classification (spam filters), identifying sentiments for a product (sentiment analysis), methods for discovering abstract topics in a collection of documents (topic modelling) and machine translation technologies. This is an Introduction to Natural Language Processing, and thus the main concepts are about cleaning, exploring datasets, and applying feature engineering techniques to transform text data into numerical data.
Aims, Objectives and Intended Learning Outcomes: This module will provide a introduction to the Natural Language Processing field using Python programming language. It covers some basic terminology, the process of ‘cleaning’ a dataset, exploring it and applying simple feature engineering techniques to transform the data. By the end of the module learners will understand and apply the necessary steps to ‘clean’, explore and transform their dataset in the appropriate order.
Dataset: Patent Dataset, Hep Dataset (High_Energy_Physics), Spam/Ham
Libraries: Before attending the course please make sure that you read the course instructions that you received.
Acknowledgements: Many thanks to Savvas Stephanides for joining me on a pair programming approach to create the function that performs the text preprocessing and for his code review. Many thanks to Joshi Chaitanya that has provided Hep Dataset and some of his code for this course, to Ian Grimstead and Thanasis Anthopoulos for providing the Patent Dataset, to Gareth Clews, Isabela Breton and Dan Lewis for reviewing the material and the code and Dave Pugh for lending the Regex material. Also thanks to everyone who attended the pilot course to provide feedback about the course.
Approach: Use one algorithm for each topic and mention the rest and where they can find them.
Topic Modelling what, why, where, when, who, how, so what, advantages, disadvantages, limitations
Topic Modelling is a method for discovering topics in a document collection. Topic is a collection of dominant keywords. Looking at the keywords, you can identify the topic.
There are many algorithms used for Topic Modelling, such as latent semantic analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), LDA (Bayesian version of pLSA), lda2vec (extension of word2vec and LDA).
In this course we will only focus on Latent Dirichlet Allocation (LDA) from the gensim library, which seems to be the most popular one.
When using LDA, each document in a set of documents is considered as a mixtrure of various topics. These topics are assigned to the document via the LDA. The assumption is that the topic distribution has a sparse Dirichlet prior, which includes the intuition that documents include a small set of topics and topics include a small set of frequent words. Topics are identified based on the likelihood of having co-occurrent terms.
The quality of the model depends on the text pre-processing, the choice of the algorithm, the number of topics we choose, the variety of the text topics and the parameter tuning.
Steps: change the Steps. make 3, 8, 3, 4, 5, 6, 6, 9, 10 , 11
Create a dictionary using corpora.Dictionary
Create the corpus and get the frequencies of the words using doc2bow
Build and Train the LDA model using gensim.models.ldamodel.LdaModel
View and Interpret Topics using print_topics
Compute model perplexity using log_perplexity. Perplexity is a measure of how well the model can predict the observed values in our sample. We can estimate perplexity for different LDA models. It is considered that the model with the lowest Perplexity performs better in predicting the observed values.
Compute model coherence using CoherenceModel. Coherence Score is a measure of how well the topic model can be humanly-intrepreted (= assessing the quality of the learned topics). A good model will provide coherent topics. The higher the coherence is, the better the topics are extracted.
Visualize topics and keywords using pyLDAvis.gensim.prepare
Find the optimal number of topics by building many LDA models with different values of number of topics and pick the one that gives the highest coherence value.
Find the dominant topic in document
Find the most representative document for each topic
Find the topic distribution across documents
Intended Learning Outcomes: By the end of Chapter 1 you will be able to:-
Define key terms in Information Retrieval (IR).
List at a high level of abstraction key steps in developing an IR application.
Describe how IR can be challenging
Describe an Inverted Index
Set up an inverted index for a document collection in Python using SCI Learn
Define 3 models used to build an IR application
Describe the Boolean Retrieval Model
Set up a Boolean Retrieval search over a document collection
Describe VSM approach to IR
Set up a VSM based IR program over a document collection
Describe Language Modelling approach to IR
Calculate maximum liklihood estimates for terms in a document collection.
Apply Linear Interpolation to query/document to determine at probability score for query/document.
The meaning of the term Information Retrieval can be quite broad.
Every time you look up information to get a task done could be considered IR.
A useful definition given by Manning (2009)
IR is finding material (usually documents) of an unstructured nature (usually text) that satifies an information need from within larger collections (usually stored on computers)
Key Terms used in Information Retrieval
An information need is the topic about which the user desires to know more about.
A query is what the user conveys to the computer in an attempt to communicate the information need.
A document is an information entity the user wants to retrieve.
A document is relevant if the user perceives that it contains information of value with respect to their personal information need.
A collection is a set of documents.
A term is a word or concept that appears in a document or query
An index is a representation of information that makes querying easier
Information Retrieval vs Web Search
IR is more than web search
IR is concerned with the finding of (any kind of) relevant information
Up until a few decades ago, people preferred to get information from other people eg booking travel via a human travel agent, librarians to search for books, paralegals etc. It used to be an activity only a few people engaged in.
The world has changed, hundreds of millions of peope engage in information retrieval (IR) every data through web search. However many other cases of IR eg email search, searching your laptop, interrogating corporate knowledge bases are also commonplace examples of search.
Information retrieval has overtaken database retrieval as most information does not reside in database systems.
Related to the above are the following issues:
Questions to tackle in retrieval
The task in information retrieval is this: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. The idea of relevance has slipped into the discussion. It is this notion which is at the centre of information retrieval. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. An IR system should generate a ranking which reflects relevance.
Most search engines use bag of words to build retrieval models. The document is treated as a bag of words
Basic Concept: Each document is described by a set of representative keywords called index terms.
Assign a numerical weight to index terms
The above index is often represented as a dictionary file of terms with an associated postings file.
This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
inverted_index_example = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ]
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
new_list = []
for index, entry in enumerate(data):
no_stopwords = ""
entry = entry.split()
for word in entry:
if word not in stop_words:
no_stopwords = no_stopwords + " " + word
new_list.append(no_stopwords)
return new_list
inverted_index_example = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(inverted_index_example)))
vectorizer = CountVectorizer()
inverted_index_vectorised = vectorizer.fit_transform(inverted_index_example)
#if u want to look at it
tdm = pd.DataFrame(inverted_index_vectorised.toarray(), columns = vectorizer.get_feature_names())
print (tdm.transpose())
## 0 1 2 3 4
## drink 1 3 1 1 1
## ink 0 0 1 1 1
## likes 2 1 1 1 1
## pink 0 0 0 1 1
## thing 0 0 1 0 0
## wink 1 0 0 0 1
The following can be said about the inverted index:-
• It maps terms to the documents that contain them. It “inverts” the collection (which maps documents to the words they contain)
• It permit us to answer boolean queries without visiting entire corpus
• It is slow to construct (requires visiting entire corpus) but this only needs to be done once
• It can be used for any number of queries
• It can be done before any queries have been seen
Exercise:
Doc 1: New home sales top forecasts.
Doc 2: Home sales rise, in July!
Doc 3 Increase in home sales, in July.
Doc 4 July new home sales rise.
Optional Extra:
Find documents matching query “pink ink”
2. Both words has to be a phrase
We could have a bi-gram index
Bi-gram index issues:
Fast but index size will explode
What aboout trigram phrases
What about proximity? “ink is pink”
A possible solution: Proximity Index
Term position is embedded to the inverted index
Called proximity/positional index
Enables phrase and proximity search
Implement positional inverted index on data shown below.
You need to save the following information in terms inverted lists:
- term (pre-processed) and its document frequency
- list of documents where this term occured
- for each document, list of positions where the term occured within the document
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes.
A retrieval model specifies the details of:
• Document representation
• Query representation
• Retrieval function: how to find relevant results
• Determines a notion of relevance
In classical IR models a document is described as a set of representative keywords - index terms . Each term is assigned a numerical weight to determine relavance.
The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly GREP referred to as grepping through text, after the Unix command grep, which performs this process. However,searching through large collections (billions to trillions of words) is unacceptably slow. More flexible matching operations require ranked retrieval.
One alternative to linearly scanning is to index the documents in advance.
Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (INCIDENCE MATRIX about 32,000 different words). The result is a binary term-document incidence matrix, as in Figure. Terms that are indexed are usually words.
We can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it. To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100.
The answers for this query are thus Antony and Cleopatra and Hamlet.
Question
What are the returned results for query
data = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ]
data = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(data)))
binary_vectorizer = CountVectorizer(binary=True)
counts = binary_vectorizer.fit_transform(data)
#if u want to look at it
tdm = pd.DataFrame(counts.toarray(), columns = binary_vectorizer.get_feature_names())
tdm=tdm.transpose()
print (tdm)
## 0 1 2 3 4
## drink 1 1 1 1 1
## ink 0 0 1 1 1
## likes 1 1 1 1 1
## pink 0 0 0 1 1
## thing 0 0 1 0 0
## wink 1 0 0 0 1
def NOT(pterm):
for a in range(len(pterm)):
if(pterm[a] == 0):
pterm[a] = 1
elif(pterm[a] == 1):
pterm[a] = 0
return pterm
term1 = list(tdm.loc['drink'])
term2 = list(tdm.loc['ink'])
term3 = NOT(list(tdm.loc['pink']))
terms = list(zip(term1, term2, term3))
vector= [terms[item][0] & terms[item][1] & terms[item][2]for item in range(len(terms))]
for i in vector:
if i == 1:
print ("Document", vector.index(i), "meets search term criteria")
## Document 2 meets search term criteria
The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words
The following can be said of the Boolean Retrieval Model:-
• It can answer any query that is made up of boolean expressions
• Boolean queries are queries that use and, or and not to join query terms
• Views each document as a set of terms
• It is precise - document matches conditions or not
• Primary commercial retrieval tool for 3 decades
• Many professional searchers (e.g., lawyers) still like boolean queries
• You know exactly what you are getting
• It does not have a built-in way of ranking matched documents by some notion of relevance
• It is easy to understand. Clean formalism
• It is too complex for web users
• Incidence matrix is impractical for big collections
Exercise:
Consider these documents:
Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients
The representation of a set of documents as vectors in a common vector space is known as the vector space model. Every distinct word has one dimension.
Key idea: Documents and queries are vectors in a high-dimensional space.
Key issues:
• What to select as the dimensions of this space?
• How to convert documents and queries into vectors?
• How to compare queries with documents in this space?
The Vector Space Model assumes that
• the degree of matching can be used to rank-order documents;
• this rank-ordering corresponds to how well a document satisfying a user’s information need
Steps in Vector Space Modelling
• Convert the query to a vector of terms
• Weight each component.
• Consult the index to find all documents containing each term
• Convert each document to a weighted vector
• Query and documents mapped to vectors and their angles compared
• Match the query vector against each document vector and sort the documents by their similarity
• Similarity based on occurrence frequencies of keywords in query and document
• Output documents are ranked according to similarity to query
Challenges
• Finding a good set of basis vectors.
• Finding a good weighting scheme for terms, since model provides no guidance.
Usually variations on (length normalised) tf*idf • Finding a comparison function, since again the model provides no guidance. Usually cosine comparison.
Comments on Vector Space Models
• Simple, practical, and mathematically based approach
• Lacks the control of a Boolean model (e.g., requiring a term to appear in a document)
Overall, Vector Space Models are hard to beat
Consider below documents and a query term
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat
A term document matrix would be set up. This is a way is a way of representing documents vectors
in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms.
Term weights are calculated for all the terms in the matrix across all the documents.
A word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents. Tf-idf = tf X idf
Similarity Measures: cosine similarity
Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. The cosine angle between each document vector and the query vector is calculated to find its closeness. To find relevant document to the query term , the similarity score between each document vector and the query term vector by is calculated by applying cosine similarity . Whichever documents have a high similarity score will be considered as relevant documents to the query term.
Summary on VSM
The IATI dataset will be used, further details on this dataset can be found here https://iatistandard.org/en/iati-standard/ The dataset used below is a subset which provides description on aid activity undertaken by various organisation in the aid sector around the world.
import operator
import pandas as pd
import re
import sklearn
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import webtext
import numpy as np
import random
import pickle
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
new_list = []
for index, entry in enumerate(data):
no_stopwords = ""
entry = entry.split()
for word in entry:
if word not in stop_words:
no_stopwords = no_stopwords + " " + word
new_list.append(no_stopwords)
return new_list
def stemming (data):
st = PorterStemmer()
for index, entry in enumerate(data):
data[index] = st.stem(entry)
return data
def read_data():
raw_data_orig = open("C:/IR Course/Adv -IR/IATI3.pkl","rb")
raw_data_orig = pickle.load(raw_data_orig, encoding='iso-8859-1')
raw_data_orig = raw_data_orig[raw_data_orig['description'].notnull()]
return raw_data_orig
query ="climate change and environmental degradation"
def preprocess(pdf):
for index, row in pdf.iterrows():
row['description'] = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split(" "))))))
return pdf
#preprocess documents
raw_data= preprocess(read_data())
#now preprocess query
query = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(query.split(" "))))))
rownames = raw_data["iati-identifier"]
#vectorise and get tfidf values
vectorizer = TfidfVectorizer()
vectorized_iati = vectorizer.fit_transform(raw_data["description"])
tdm = pd.DataFrame(vectorized_iati.toarray(), columns = vectorizer.get_feature_names())
tdm=tdm.set_index(rownames)
#now vectorise query
vectorized_query=vectorizer.transform(pd.Series(query))
query = pd.DataFrame(vectorized_query.toarray(), columns = vectorizer.get_feature_names())
# get cosine similarity
def cos_sim (pdf, qdf):
f_similarity={}
for index, row in qdf.iterrows():
for index2, row2 in pdf.iterrows():
cos_result = cosine_similarity(np.array(row).reshape(1, row.shape[0]), np.array(row2).reshape(1, row2.shape[0]))
f_similarity[index2] = round(float(cos_result),5)
return f_similarity
cosine_scores=cos_sim (tdm, query)
#now rank
final_rank= sorted(cosine_scores.items(), key=operator.itemgetter(1), reverse=True)
final_rank = final_rank[0:5]
rownames = rownames.tolist()
unprocessed = read_data()
for item in final_rank:
if item[0] in rownames:
print('IATI-IDENTIFIER {0} DESCRIPTION {1}'.format(item[0],unprocessed.iloc[rownames.index(item[0]),2]))
## IATI-IDENTIFIER 41AAA-11960-005 DESCRIPTION UNOPS supports the Global Environment Facility (GEF) Small Grants Programme that helps protect poor, remote villages from the serious effects of climate change and environmental degradation. In an effort to support community-led initiatives, UNOPS efficiently channels direct grants to help communities cope with climate change, conserve biodiversity, protect international waters, reduce the impact of persistent organic pollutants, prevent land degradation, and adopt sustainable forest management practices.
## IATI-IDENTIFIER 41120-8599 DESCRIPTION vOverall Objective:Residents of cities in developing countries and their urban systems begin to become more resilient to the impacts and climate change, and reduce their carbon emissions.We intend to achieve this objectivethrough the followingExpected Accomplishments (EAs):EA1:The urban dimension is introduced into climate change agreements, strategies, policies, laws and regulations, and the climate change dimension is introduced into urban strategies, policies, laws and regulationsEA2:Urban decision-makers and stakeholders have increased capacity to reduce greenhouse gas emissions and adapt to climate change, and the institutions that build their capacities have adapted their teaching curricula and research accordinglyEA3:Cities participating in CCCI develop and begin to implement pro-poor strategies to adapt to climate change and embrace low carbon growth trajectoriesEA4:A global network of partners begins to advocate for policies to help cities better address climate change, access climate finance, and manage knowledge.The underlying development hypothesis here that relates the EAs to our Overall Objective is that together policy change and capacity building will lead to changes in behaviour. For example, policy reform at either the national or the local level (EAs Nos. 1 and 3, respectively), coupled with adequate institutional and human capacity (addressed in EA 2), will lead to actual reduced emissions from cities (part of Overall Objective). For further discussion of our strategy for implementing individual expected accomplishments, please see Section 2 of the CCCI Consolidated Strategy (May 2013), developed for discussion with the CCCI External Advisory Committee (available upon request).CCCI has followed roughly this logical framework for the past several years. However in recent years sharp cuts in annual replenishments, coupled with the increased visibility of cities in international processes with related demands on UN-Habitat, have tended to reduce resources available for city-level work (EA 3).
## IATI-IDENTIFIER 41119-GW-REGULAR-S12-UNFPA DESCRIPTION UNFPA Guinea-Bissau regular-funded Activities to strengthen national capacity for production and dissemination of quality disaggregated data on population and development issues that allows for mapping of demographic disparities and socio-economic inequalities, and for programming in humanitarian settings activities implemented by UNFPA
## IATI-IDENTIFIER 41119-SS-OTHER-S10-NGO DESCRIPTION UNFPA South Sudan other-funded Activities to increase capacity to prevent gender-based violence and harmful practices and enable the delivery of multisectoral services, including in humanitarian settings activities implemented by NGO
## IATI-IDENTIFIER 41AAA-21461-001 DESCRIPTION Mejoramiento de la infraestructura de dos escuelas agrÃcolas
Exercise:
From the IATI10k.csv file, extract a sample of records (for example 100 rows) then do the following:
1. Put the description column through pre-processing. Make a decision on what preprocessing routines would be suitable.
2. Set up a suitable query to interrogate the document collection.
3. Construct the inverted index with tf-idf scores. Ensure that the query is has also been converted to a vector with tf-idf scores.
4.Compare the query vector with all other vectors in the document collection and calculate cosine similarity. Then store in a dictionary the iati-identifer field as a key with the cosine score as a value in a Python dictionary.
5. Rank the dictionary by cosine scores (value field in the dictionary) and print the top 10 scores (sort in ascending value).
Use probability to determine relevance. How well does a document satisfy the query ?
An IR sytem has an uncertain understanding of the user query and makes an uncertain guess of whether a document satisifes the query.
Probability theory provides a principled foundation for such reasoning under uncertainty
The query and the documents are all observations from random variables. In the vector-based models, we assume they are vectors, but here we assume they are the data observed from random variables
And so, the problem of retrieval becomes to estimate the probability of relevance
In this category of models, there are different variants.
Classical probabilistic retrieval models
Binary Independence Model
Okapi BM25
Bayesian networks for text retrieval
Language model approach to IR
In query likelihood, our assumption is that this probability of relevance can be approximated by the probability of query given a document and relevance.
How do we compute this conditional probability?
This is where we build a Language Model.
What is a language model ?
** “The goal of a language model is to assign a probability to a sequence of words by means of a probability distribution” ** –Wikipedia
To understand what a language model, must know what is a:
• probability distribution
• discrete random variable
In a unigram language model we estimate (and predict) the likelihood of each word independent of any other word
Defines a probability distribution over individual words
Sequences of words can be assigned a probability by multiplying their individual probabilities:
** P(university of north carolina) = P(university) x P(of) x P(north) x P(carolina) = (2/20) x (4/20) x (2/20) x (1/20) = 0.0001 **
There are two important steps in language modeling
‣ estimation: observing text and estimating the probability of each word
‣ prediction: using the language model to assign a probability to a span of text.
General estimation approach:
‣ tokenize/split the text into terms
‣ count the total number of term occurrences (N)
‣ count the number of occurrences of each term (tft)
‣ assign term t a probability equal to
• Suppose we have a document D, with language model
• We can use this language model to determine the probability of a particular sequence of text
• How? We multiple the probability associated with each term in the sequence!
Question:
What is the probability given by this language model to the sequence of text “rocky is a boxer” or “a boxer is a pet”?
To summarise how is the document model estimated for each document?
• Objective: rank documents based on the probability that they are on the same topic as the query
• Solution:
‣ Score each document (denoted by D) according to the probability given by its language model to the query (denoted by Q)
‣ Rank documents in descending order of score
Every document in the collection is associated with a language model
• Let denote the language model associated with document D
• Think of a “black-box”: given a word, it outputs a probability
Let P(t|θD) denote the probability given by to term t
Question:
Which would be the top-ranked document and what would be its score?
P(q|M1) > P(q|M2)
There are (at least) two issues with scoring documents based on query terms
A document with a single missing query-term will receive a score of zero (similar to boolean AND)
• Where is IDF?
• Na attempt should be made to suppress the contribution of terms that are frequent in the document and also frequent in general (appear in many documents)?
• The goal of smoothing is to …
‣ Decrease the probability of observed outcomes
‣ Increase the probability of unobserved outcomes
Add One Smoothing
A more effective approach to smoothing for information retrieval is called linear interpolation
Let denote the language model associated with document D
• Let denote the language model associated with the entire collection
• Using linear interpolation, the probability given by the document language model to term *t is:
As before, a document’s score is given by the probability that it “generated” the query
• As before, this is given by multiplying the individual query-term probabilities
• However, the probabilities are obtained using the linearly interpolated language model
Without smoothing, the query-likelihood model ignores how frequently the term occurs in general!
import nltk
import sys
import codecs
import nltk
from nltk.corpus import stopwords
import csv
import pandas
import re
import numpy as np
df = pandas.read_csv('C:/IR Course/Adv -IR/IATI10k.csv', header = 0, encoding="iso-8859-1")
df = df[df.description.notnull()]
def set_tokens_to_lowercase(data):
for index, entry in enumerate(data):
data[index] = entry.lower()
return data
def remove_punctuation(data):
symbols = ",.!"
for index, entry in enumerate(symbols):
for index2, entry2 in enumerate (data):
data[index2] = re.sub(r'[^\w]', ' ', entry2)
data[index2] = entry2.strip()
return data
def remove_stopwords_from_tokens(data):
stop_words = set(stopwords.words("english"))
stop_words.add(" ")
new_list = []
for index, entry in enumerate(data):
if entry not in stop_words:
new_list.append(entry)
return new_list
def clean_df(pdf):
for index, row in pdf.iterrows():
row['description'] = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split())))
row['description'] = " ".join(x for x in row['description'])
return pdf
def calc_docscore(pdf, pqry):
col_names = ['Description', 'score']
f_df2 = pandas.DataFrame(columns = col_names)
for index, row in pdf.iterrows():
rank = []
docscore = 0
scored = score(row['description'])
for word in pqry.split(" "):
if word in scored.keys():
rank.append(float(scored[word] )+ float(allcounts[word]/total)/2)
if rank != []:
docscore = np.prod(np.array(rank))
f_df2.loc[index] = pandas.Series({'Description':row['description'], 'score':docscore})
return f_df2
def score (pstr):
fdict = {}
flist = pstr.split()
fdict = dict(nltk.FreqDist(flist))
for key, value in fdict.items():
fdict[key] = round(fdict[key]/len(flist),2)
return fdict
df = clean_df(df)
qry = "reduce transmission of HIV"
qry= remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(qry.split())))
qry = " ".join(x for x in qry)
allcounts = {}
for descript in df['description']:
tmp = dict(nltk.FreqDist(descript.split()))
for key, value in tmp.items():
if key not in allcounts:
allcounts[key] = value
else:
allcounts[key] = allcounts[key] + value
total = sum(allcounts.values())
df2=calc_docscore(df, qry)
df2sort_by_score = df2.sort_values('score', ascending=False)
print (df2sort_by_score[1:20])
## Description score
## 136 goal project contribute reduction hiv incidenc... 0.121396
## 180 sa school-based sexuality hiv prevention educa... 0.121396
## 142 hiv prevention treatment professional sex work... 0.121396
## 143 hiv prevention treatment professional sex work... 0.121396
## 360 ?improving diabetes care prevention piloting h... 0.121396
## 167 reduce hiv/aids prevalence prevention identify... 0.120252
## 9537 unops helps procure retroviral drugs hiv progr... 0.101396
## 311 pace uganda sub mildmay hiv related project fu... 0.101396
## 211 objective program increase use evidence-based ... 0.101396
## 319 ?psi subreceipent project hope usaid award eth... 0.101396
## 281 sfh south africa/psi sub tb hiv care award hts... 0.0913958
## 65 psi kenya helping 3ie build evidence towards u... 0.0913958
## 38 overall goal reduce mortality morbidity due ma... 0.0902525
## 114 expand strengthen high quality community-based... 0.0813958
## 266 goal program scale-up strengthen delivery qual... 0.0813958
## 242 increased access universal hiv prevention serv... 0.0813958
## 316 using user-centered approaches increase adopti... 0.0813958
## 149 global fund project funded zimbabwe ministry h... 0.0802525
## 249 hiv identification treatment program lesotho -... 0.0713958
Exercises:
Suppose the document collection contains two documents:
\(d_1\): Xyzzy reports a profit but revenue is down
\(d_2\): Quorus narrows quarter loss but revenue decreases further
The query is: “revenue down”
Calculate maximum liklihood estimates for terms in document 1 and document 2.
Apply the linear interpolation and calulate the score for query/document 1 and query/document 2.
D1: He likes to wink, he likes to drink
D2: He likes to drink, and drink, and drink
D3: The thing he likes to drink is ink
D4: The ink he likes to drink is pink
D5: He likes to wink, and drink pink ink
Query: “drink pink ink”
Write Python code to do the following:
Elasticsearch is a real-time distributed and open source full-text search and analytics engine.
It is accessible from RESTful web service interface and stores documents in JSON (see example https://json.org/example.html) format.
It is built on Java programming language and hence Elasticsearch can run on different platforms.
It enables users to explore very large amount of data at very high speed.
At heart it uses an inverted index as shown in Figure X. It maps terms to documents (and possibly positions in the documents) containing the term.
An index is a collection of documents, and a shard is a subset thereof. Documents are scored using tf-idf calculations.
To minimize index sizes, various compression techniques are used. For example, when storing the postings (which can get quite large).
Updating the index in elastic search is a delete followed by a re-insertion of the document. This keeps the data structures small and compact at the cost of an efficient update.
When new documents are added (perhaps via an update), the index changes are first buffered in memory.
Eventually, the index files in their entirety, are flushed to disk.
Data Volume
Amount of digital data in 2007: 281 exabytes = 281 trillion digitized novels
“Every 2 days now, we create as much information as we did from the dawn of civilisation up until 2003”
Eric Schmidt
Vocabulary mismatch problem due to synonymy and polysemy.
The same word has different meanings.
A search engine might not be able to guess the right meaning if appropriate contexts are not provided.
IR-systems are as good as the query provided to them.
Queries are provided by the human, and human is the weak link in this chain.
So, high quality query is a must. With a very bad query, you can defeat any search engine.
A search query is : Windows
For the query, a search engine (like- Google) can show results of three types as following:
Computer OS: Windows
Windows of buildings
Combination of both (1) and (2)
It is not the intention of a search engine to provide results of type 3, i.e., combined results of Windows of OS and buildings.
Because, a user, who works in a building company, might not want Computer OS Windows as the output of the query.
The output should be building windows for this type of users.
On the other-hand, similarly. another user working as a software engineer, should get the output of Windows OS for the query.
This type of query results based on person’s interests is called personalized search engines.
It is one of the most challenging sides of Information Retrieval (IR) to provide results based on person’s interests and ranked the results accordingly.